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ABSTRACT 

Linear microphone arrays can be employed for acoustic 
event localization in a noisy environment using time dday 
estimation. Three techniques are investigated that allow de- 
lay estimation, namely Normalized Cross Correlation, LMS 
Adaptive Filters, Crosspower*Spectrum Phase: they are 
combined with a bidimensional representation, the Coher- 
ence Measure, in order to emphasize information that can 
be exploited for estimating position of both non-moving and 
moving acoustic sources. To compare the given techniques, 
different acoustic sources were considered, that generated 
events in different positions in space. Expressing perfor- 
mance in terms of accuracy of the wave&ont direction angle, 
experiments showed that the Crosspowei-Spectrum Phase 
based technique outperibrms the other two. This technique 
provided very promising preliminary results also in terms 
of source position estimatbn. 



1. INTRODUCTION 

In the last decade, some research e&rt has been devoted to 
microphone array processing techniques [ll, especially for 
teleconferencing and large room recording [2], but also for 
speech recognition [3]. 

Recently, the use of microphone array technology for 
acoustic survdllance purposes has been considered ^: the 
objective of this activity is detection of acoustic events (e.g. 
explosions, screams, etc.) that can occur in a given envi- 
ronment, as well as localization of the acoustic source that 
generated them. This paper will investigate the problem of 
source localization, when a linear microphone array (con- 
sisting of four omnidirectional microphones) is used for ac- 
quisition of such events in a real noisy environment. 

From a theoretical point of view, the signals acquired by 
each microphone can be assumed to be delayed replicas of 
the source signal plus noise: localizing the sound source is 
equivalent to estimating the time delays between the signals 
received. Once the ddays are known the acoustic event 
direction can be derived using geometry. 

The locaiization technique described in this worh consists 
in determining the source position as crossing point between 
directions estimated beginning from signals acquired by mi- 
crophone pairs. This method requires a very accurate time 
delay estimation but results the most effective, as described 
in {8], when the acquisition array consists of a few micro- 
phones. - 

Three different time delay estimation techniques (Nor- 
malized Cross Correlation (NCC), LMS Adaptive Filters 

^This work was partially supported hy the ESPRIT 5345 
DIMUS project, where a system is being developed for surveil- 
lance of underground stations. 



(LMS), Crosspower-Spectrum Phase (CSP)) were compared 
using both real environment signals and synthetically de- 
layed and distorted replicas. Information on relative delays 
is summarized and visualized using a meaningful represen- 
tation called Coherence Measure. 

Results in terms of angle as well as source position accu- 
racy showed a definite superiority of the CSP-based tech- 
nique. 

2. GEOMETRICAL MODEL 

This section introduces the general sound source model, for 
a two dimensional geometry and a linear microphone array 
consisting of M acoustic sensors. The geometry of the ar- 
ray is represented by the sensor positions (po(2o,yo), , 

pjif.i^ZAr-.i ,yM-i)). We assume that an acoustic source lo- 
cated u position (2«,y«) (see Figure 1) generates an acoustic 

event r(t) that is acquired by microphones 0, , {M - I) 

as signals so(t), «M-i(t). 



y 




Figure 1. Wavefront propagation of an acoustic stimulus 
generated in position {zs,y,). Signab 501^1,52,53 are ac- 
quired through an array of microphones placed in positions 
pOi pi, P2i P3. The wavefront reaches microphones 1,2,3 
with delays 5oi, ^02, fos, wtft respect to microphone 0. 

For the given source signal r(t), propagated in a generic 
noisy environment, the signal acquired by the acoustic sen- 
sor V, can be expressed as follows: 

Si(t) = af.T(t-r;) + ni(<) (1) 

where a, is an attenuation factor due to propagation ef- 
fects, ri is the propagation time and n,-(f ) includes all the 
contaminating noises, which axe assumed to be uncorrelated 
with r(t). We also indicate with 6ij{x,y) the relative de- 
lay of wavefront arrival between microphones "i^ and 
assumed a source in position (x,y) and in particular: 

Sij sz 6ij(x,, y,) = (r, - n). (2) 
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3. COH£R£NC£ MEASURE 

As pointed out above, different tedmiqaes can be exploited 
fot time delay estimation. Information on mutnal delay 
between signals can be reconducted into a representation 
called Coherence Measure (CM) and associated to a func- 
tion Cij[t,r) that erpresseSf given a delay r the amilarity 
between segments (centered at the time instant i) extracted 
from two generic signals Si and Sj, It is expected to have 
a prominent peak at the delay r = Sij, corresponding to 
the direction of wavefront arrival. For each couple of mi- 
crophones, and for each technique, a bidimensional CM- 
representation can be conceived as shown in Figure 2: in 
this representation horizontal axis is referred to time, verti- 
cal axis is referred to delay and the coherence magnitude is 
represented through a grey scale. Following examples will 
show CMs evaluated with an analysb rate of 10.65 ms. 

Both for moving and for non-moving sources, this CM 
can be exploited to derive the source position. When the 
acoustic source is moving, the CM maximum should depict 
a curve that follows the contour of the theoretical delay 
for the given microphone pair (i,y). For non-moving 
sources CM representation is characterized by a line at the 
theoretical delay; hence, the delay can be easily extracted. 

4. ANALYSIS TECHNIQUES 

DiiFerent methods can be conceived to derive the CM rep- 
resentation. In this section, three techniques axe described, 
namely: Normalized Cross Correlation, LMS Adaptive Fil- 
ters and Crosspower-Spectrum Phase. 

4.1. Nonnalized Cross Correlation 
The most common method of determining the generic time 
delay Sij and the corresponding arrival angle, given two 
signals Si{i) and sj{t), requires to estimate for every dday 
r, the cross correlation function: 

Rij(r) = E[3i(i)6j(t^T)] (3) 

where £ denotes expectation. Given the model (1), (3) can 
be expressed as: 

Rij (r) = aiajRrr(r - Sij) -h An (r) (4) 

where Rrr(r) represents the autocorrelation of the source 
signal r(i], evaluated at lag r. 6ij can be theoretically de- 
rived maximizing this function with respect to r. However, 
due to the finite observation time, (3) can be only estimated 
for a given temporal window of length T, centered at time 
i; we denote this estimate as: 

^) = r / + r)du. (5) 

When dealing with real noisy signals, the simple maxi- 
mization based approach applied to (5) can easily fail, due 
both to signal properties (dependent also on microphone 
characteristics) and to limits of the mathematical model. 
A reasonable improvement [5] consists in the normalization 
of (5) with respect to the signal energies, leading to: 

In the following, we will refer to a iiist type of Coherence 
Measure defined as: 

CU(n,l)r=.Bf'^\n,l). (7) 




Figure 2. Coherence Measure representation based on the 
NCC (a), the LMS (b) and the CSP (c) analyses between 
signals so (plotted in the upper part of the Figure)and s\: 
the acoustic stimulus was a vowel "a", uttered in a noisy 
enrnronment by a speaker that was walking from the left to 
tftc right of the microphone pair, 

that indicates the digital counterpart of (6). 

In an ideal situation, when (1) is simplified imposing both 
that a. = 1 for every i (i.e. there is no attenuation) and 
that noise components n, (t) are uncorrelated, /) has 

a peak equal to 1 for the lag corresponding to the delay 6i). 
But, if r(t) is a periodic signal with period Tp = PT, (de- 
noting sampling rate period with T.), ^j'^(n,/) contains 
other unitary peaks, for each lag Sij + kTp (with k integer): 
this property can be reconducted to the periodic character- 
istic of Rrr(r), Also microperiodicities (e.g. due to ibrmant 
structure of speedi vowels) contribute to make evident other 
misleading peaks. 

Further, [5] pointed out other problems that arise apply- 
ing a simple peak-picking algorithm to CM expressed by 
(7): a critical issue is the window length T; in the following 
we will refer to the use of a window length of 21.3 ms (1024 
samples at F,^ 48 kHz), that resulted a reasonable com- 
promise between complexity and performance, especially for 
moving sources. 

4.2. LMS Adaptive Filters 

The LMS Adaptive Filter [6] is a Finite Impulse Response 
(FIR) filter that automatically adapts its coefficients to 
minimize the mean square difference between its two inputs. 
It does not require any a priori knowledge of the input spec- 
tra. The structure consists of two input signals: a reference 
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signal 3j{i) and a desired signal 3{{t); both signab can be 
modeled as (l)(their samples at time nTs will be denoted 
with si{n) and Sj{n)). The LMS Adaptive Filter output is 
based on the following formula: 

yiAn)--WFi(n)Xi,in) (8) 
where T denotes transpose and ^ij(n) = [sj(n),Sj{n - 

l)i ^sj{n - Z. + 1)]^ is the filter state, consisting of the 

most recent samples of the reference signal. The vector 
Wij{n) is the L-vector of filter weights at instant ». The 
error output eiy(n) is computed as follows: 

eijin)^Si{n)^W^^(n)Xij{n). (9) 
The weight vector is updated every sample: 

Wij{n + 1) = Wijin) + Me.i{n)^o(») (10) 
where * denotes the complex conjugate and ^ ^s a feedback 
coefficient that controls the rate of convergence and the 
algorithm stability. 

The algorithm adapts the FIR filter to insert a delay equal 
and opposite to that existing between the two signals: in an 
ideal situation, the filter weight corresponding to the true 
delay would be unity and all other weights would be zero. 

For our purposes and due to the properties of the weight 
vector Wij{n)^ ¥re define a second type of CM as: 

C3(nJ) = wo(n,/) (11) 

where wij(n,lo) is the V7iy(n) component for lag fo. 

4.3. Crosspower-Spectnxm Phase 
Starting from a mathematical modeling similar to (1), [7] 
proposed a maximum likelihood estimator for determin- 
ing time delays between two signals Si and 3k* Prefilter- 
ing signals before computing corielation, leads to the so- 
called Generalized Cross Correlation method. Basically, the 
Fourier transform of (4) provides the Crosspower-Spectrum: 



(12) 



The Generalized Cross Correlation between Si(i) and 
Sk{t) is defined as: 



(13) 



where rlfg{f) is a general frequency weighting filter. 

A way to sharpen the cross correlation peak is to 
"whiten* the input signals: the choice 



(14) 



leads to the so-called phase correlation technique [7]* With 
such a choice: 

and, if noise signals aie (mcorr elated it follows that: 



(16) 



It is worth noting that, contrary to the other two delay es* 
timate techniques, the generalized correlation given by (15) 
is independent from the input waveform characteristics, Le. 



in the ideal case it reduces to a delta function centered at 
the correct delay 5,*. 

In practice, the procedure for estimating the generalized 
correlation starts from the computation of spectra Si(i,f) 
and Sk(i,f) through Fourier transforms applied to win- 
dowed segments of Si and Sk, centered around time instant 
t. Then, these power spectra are used to estimate the nor- 
malized Crosspower-Spectrum: 

\Si(tJ)\\Sk(tJ)\ ^ ^ 

that preserves only information about phase differences be- 
tween Si and Sfc. Finally, the inverse Fourier transform 
Rik(i,T) of ^(f,/) is computed. Also in a real situation, 
the resulting function (defined in the lag axis r) has a 
constant energy, mainly concentrated on the correct de- 
lay 6ik> The Coherence Measure introduced in this case 
isCJi'(n.O = Afc(n,/). 

4.4. Delay estimation algorithm 
Given the CM representation, the source position can be 
derived in different ways: as pointed out, if the source is 
non-moving CM should consist of a dominant straight line 
at the theoretical delay. Hence, starting from Coherence 
Measure Ciy(n, /) evaluated for microphone pair (i,;), for 
all delays I {-Imax <= I <- Imax) and time samples n 
(1 <= n <= jV), a lag can be estimated as follows: 



Tii = aigmax[^Cii(n,0]. 



(18) 



Once given two delay hypotheses, obtained by processing 
two microphone pairs, source direction is estimated averag- 
ing angles corresponding to these delays, while source posi- 
tion is computed as crossing point between such directions. 

5. EXPERIMENTS AND RESULTS 

Different approaches can be followed to evaluate and com- 
pare performance that can be obtained with the three men- 
tioned analysis techniques. As described in the following, 
two series of experiments were conducted, one operating on 
synthetic signals, the other on real signals, collected in a 
noisy environment. Both series confirmed the superiority 
of the CSP-based technique to the others. 

5*1. Simulation Experiments 

Given a real signal, originally acquired with a sampling fre- 
quency of 48 kHz from one of the array microphones, and 
another, obtained shifting the original one of a given de- 
lay, two new versions of these signals were artificially deter- 
mined adding both different attenuated and shifted replicas 
(to simulate reverberation phenomena) and different white 
noise sequences, to better match the mathematical model- 
ing introduced with (1). 

During the simulation experiments, three real signals 
were considered, that is a syllable /pa/, a whistle sound 
and a long speech message. For each of them, 300 artificial 
signal couples were generated, using different combinations 
of attenuation factors, interchannd delays and white noise 
-magnitude. - - 

Using the CM-functions based on the three given tech- 
niques, delay estimates were computed by the previously 
described algorithm. Table 1 shows that the mean delay er- 
ror and its standard deviation are very different, given the 
three techniques, even if they deal with artificial signals. In 
particular, using the NCC-based technique and, sometimes, 
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Table J. Mean and standard deviations of the lag error (ex- 
pressed in samples) obtained applying the three techniques 
to 900 artificial signal couples, 

the LMS-based one, often provide a delay slightly diiFerent 
(one or two samples) from the theoretical one, that can 
canse an unacceptable peformance in tenns of wavefront 
arrival angle accuracy. This behavior caji be explained as 
follows: these two teduiiques are strongly influenced by the 
presence of microperiodidties in the given signals; in this 
way, the corresponding CM-function curves depict a pri- 
mary peak, that results from "Hnterpolation" among peaks 
positioned both at the correct dday and at other signal- 
dependent spurious delays. 

As previously mentioned, the CSP-based technique is free 
from this influence since it can be considered ^independent" 
from the given signal characteristics. 

5.2. Real Signal Experiments 
In a first scenario a database of 97 stimuli was collected in 
an acquisition room, in some cases with a background noise 
previously recorded in an underground station to simulate 
a real-noise environment. In this case, acquisition was ac« 
complished by using an array consisting of four equispaced 
microphones: distance between microphones was 15 cm. 
Approximately half of the stimuli were acquired in pres- 
ence of background noise with an average SNR of 15 dB, 
The database consists of screams (the syllable /pa/ uttered 
loudly by one speaker), whistle-sounds, gun-shots. Acous- 
tic events were generated in the following room positions 
(expressed in meters in x and y axes, respectively): (0,1), 
0.2) (0,3), fl l). (1,2), (1.3), (2,2), (2,3), (-1,1) (-1,2 , 
(-1,3). (-2.0.5) (.2,1), (-2,2), (-2,3). 

In terms of wavefront direction angle, performance con- 
firms properties that were observed in previous discussions. 
In particular, given a reasonable tolerance of 5**, the CSP- 
based technique ensures performance 20% better than LMS 
and 40% better than NCC based techniques. However, per- 
formance in terms of localization accuracy were not con- 
sidered satisfactory, especially using NCC and LMS based 
technique. Table 2 shows performance expressed in terms 
of accuracy of the wavefront arrival angle, given three dif- 
ferent tolerances of 2**, S'', 10**. 
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Table 2. Percentage of acoustic events that were correctly 
identified in terms of wavefront direction angle, given three 
different angle tolerances. 

To better investigate localization accuracy allowed by the 
CSP-based analysis technique, a second scenario was con* 
ceived using a new array configuration that consists of two 
distant microphone pairs: distance between microphones of 
each pair was 15 cm, while distance between microphone 
pairs was 75 cm. 

A new database of 100 acoustic stimuli was acquired in 
an office environment. Acoustic events were generated in 




(2,1), (4,5). (-2,0.5), (-2,1), (-2.5), (-3,5). In eadi position 
we produced a gun shot, a handbeat, a scream, a short word 
and a whistle. 

Performance in terms of wavefront direction angle were 
comparable with those obtained in the Hist scenario (90% 
accuracy given a tolerance of 5**). Concerning localization 
accuracy, given a tolerance of (15 cm, 50 cm), 63% of stimuli 
were correctly localized. Given a tolerance of (30 cm, 100 
cm), 85% of stimuli were correctly localized. 

These experiments confirm the good behaviour of the 
CSP-based technique but also the importance of the ar- 
ray geometry. Clearly, the use of more pairs is expected to 
provide further substantia] improvement. 

6. CONCLUSIONS 

This work provided a comparison among three techniques 
of acoustic source localization. That based on Crosspower- 
Spectrum Phase has the best properties for the estimation 
of the wavefront arrival duection: it is the most robust both 
in clean and in non critical noisy environments. Issues that 
will be investigated are new microphone array configura- 
tions as well as robustness of this technique in more severe 
background noise conditions. 

Even if in this work the computational aspect was not em- 
phasized, the proposed technique offers many advantages 
aiso from this point of view: at the moment, the result- 
ing localization system runs in real-time on a DSP-board 
equipped with two DSP32C and four acquisition channels, 
operating with a sampling frequency of 48 kHz and 16 bit 
accuracy. 

Finally, the CSP based analysis is being investigated for 
other purposes as talker tracking and speech enhancement 
[8]. 
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